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Abstract 

We consider the linear regression problem. We propose the S-Lasso pro- 
cedure to estimate the unknown regression parameters. This estimator enjoys 
sparsity of the representation while taking into account correlation between 
successive covariates (or predictors). The study covers the case when p S> n, 
i.e. the number of covariates is much larger than the number of observations. 
In the theoretical point of view, for fixed p, we establish asymptotic normality 
and consistency in variable selection results for our procedure. When p > n, 
we provide variable selection consistency results and show that the S-Lasso 
achieved a Sparsity Inequality, i.e., a bound in term of the number of non-zero 
components of the oracle vector. It appears that the S-Lasso has nice variable 
selection properties compared to its challengers. Furthermore, we provide an 
estimator of the effective degree of freedom of the S-Lasso estimator. A simu- 
lation study shows that the S-Lasso performs better than the Lasso as far as 
variable selection is concerned especially when high correlations between suc- 
cessive covariates exist. This procedure also appears to be a good challenger 
to the Elastic-Net [361- 
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1 Introduction 



We focus on the usual linear regression model: 



Hi = %i(3* + Si 



i = 



1,... 



n 



(1) 



where the design Xi = (x^i, . . . , x^ p ) G W is deterministic, (3* = (/?*, . . . , /?*)' G M p 
is the unknown parameter and £i,...,e n , are independent identically distributed 
(i.i.d.) centered Gaussian random variables with known variance a 2 . We wish to 
estimate (3* in the sparse case, that is when many of its unknown components equal 
zero. Thus only a subset of the design covariates is truly of interest where 
= (xij, . . . , x n j)', j = 1, . . . ,p. Moreover the case p ^> n is not excluded so that 
we can consider p depending on n. In such a framework, two main issues arise: i) 
the interpretability of the resulting prediction; ii) the control of the variance in the 
estimation. Regularization is therefore needed. For this purpose we use selection 
type procedures of the following form: 



where X = (x[, . . . , x' n )\ Y = (yi, . . . , y n )' and pen : W — > R is a positive convex 
function called the penalty. For any vector a = (ai, . . . ,a n )', we have adopted the 
notation ||a||^ = tt." 1 X^ILi l a *| 2 ( we denote by < -, • > n the corresponding inner 
product in M. n ). The choice of the penalty appears to be crucial. Although well- 
suited for variable selection purpose, Concave-type penalties (pJ2j, [27] and [6]) are 
often computationally hard to optimize. Lasso-type procedures (modifications of the 
li penalized least square (Lasso) estimator introduced by Tibshirani [25]) have been 
extensively studied during the last few years. Between many others, see [2 01 [34] and 
references inside. Such procedures seem to respond to our objective as they perform 
both regression parameters estimation and variable selection with low computational 
cost. We will explore this type of procedures in our study. 

In the paper, we propose a novel modification of the Lasso we call the Smooth- 
lasso (S-lasso) estimator. It is defined as the solution of the optimization problem (j2J) 
when the penalty function is a combination of the Lasso penalty (i.e., YTj=x l/^j'l) 
and the ^-fusion penalty (i.e., Y7j=2 (@j ~ Pj-i) 2 )- The /2-fusion penalty was first 
introduced in [15]. We add it to the Lasso procedure in order to overcome the variable 
selection problems observed by the Lasso estimator. Indeed the Lasso estimator has 
good selection properties but fails in some situations. More precisely, in several 
works ([2 HH CEH EH [32l [34] [35] among others) conditions for the consistency in 
variable selection of the Lasso procedure are given. It was shown that the Lasso is 



= Argmin { \\Y - X(3\\ 2 n + pen(/3)} , 



(2) 
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not consistent when high correlations exist between the covariates. We give similar 
consistency conditions for the S-Lasso procedure and show that it is consistent in 
variable selection in much more situations than the Lasso estimator. From a practical 
point of view, problems are also encountered when we solve the Lasso criterion with 
the Lasso modification of the LARS algorithm [10]. Indeed this algorithm tends to 
select only one representing covariates in each group of correlated covariates. We 
attempt to respond to this problem in the case where the covariates are ranked so 
that high correlations can exist between successive covariates. We will see through 
simulations that such situations support the use of the S-lasso estimator. This 
estimator is inspired by the Fused-Lasso [26]. Both S-Lasso and Fused-Lasso combine 
a /i-penalty with a fusion term [15]. The fusion term is suggested to catch correlations 
between covariates. More relevant covariates can then be selected due to correlations 
between them. The main difference between the two procedures is that we use the h 
distance between the successive coefficients (i.e., the Z 2 -fusion penalty) whereas the 
Fused-Lasso uses the l\ distance (i.e., the Zi-fusion penalty: YJj=2 fij-A)- Hence, 
compared to the Fused-Lasso, we sacrifice sparsity between successive coefficients in 
the estimation of (3* in favor of an easier optimization due to the strict convexity of 
the I2 distance. However, since sparsity is yet ensured by the Lasso penalty. The 
/ 2 -fusion penalty helps us to catch correlations between covariates. Consequently, 
even if there is no perfect match between successive coefficients our result are still 
interpretable. Moreover, when successive coefficients are significantly different, a 
perfect match seems to be not really adapted. In the theoretical point of view, The 
I2 distance also helps us to provide theoretical properties for the S-Lasso which in 
some situations appears to outperforms the Lasso and the Elastic-Net [36], another 
Lasso-type procedure. Let us mention that variable selection consistency of the 
Fused-Lasso and the corresponding Fused adaptive Lasso has also been studied in 
[20] but in a different context from the one in the present paper. The result obtained 
in [20] are established not only under the sparsity assumption, but the model is also 
supposed to be blocky, that is the non-zero coefficients are represented in a block 
fashion with equal values inside each block. 

Many techniques have been proposed to solve the weaknesses of the Lasso. The 
Fused-Lasso procedure is one of them and we give here some of the most popular 
methods; the Adaptive Lasso was introduced in [35], which is similar to the Lasso 
but with adaptive weights used to penalize each regression coefficient separately. 
This procedure reaches 'Oracles Properties' (i.e. consistency in variable selection 
and asymptotic normality). Another approach is used in the Relaxed Lasso [17] 
and aims to doubly-control the Lasso estimate: one parameter to control variable 
selection and the other to control shrinkage of the selected coefficients. To overcome 
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the problem due to the correlation between covariates, group variable selection has 
been proposed by Yuan and Lin [31] with the Group-Lasso procedure which selects 
groups of correlated covariates instead of single covariates at each step. A first step 
to the consistency study has been proposed in [T] and Sparsity Inequalities were given 
in [5]. Another choice of penalty has been proposed with the Elastic-Net [36]. It is 
in the same spirit that we shall treat the S-Lasso from a some theoretical point of 
view. 

The paper is organized as follows. In the next section, we present one way to 
solve the S-Lasso problem with the attractive property of piecewise linearity of its 
regularization path. Section [3] gives theoretical performances of the considered es- 
timator such as consistency in variable selection and asymptotic normality when 
p < n whereas consistency in estimation and variable selection in the high dimen- 
sional case are considered in Section [4j We also give an estimate of the effective 
degree of freedom of the S-Lasso estimator in Section [5l Then, we provide a way to 
control the variance of the estimator by scaling in Section [6] where a connection with 
soft-thresholding is also established. A generalization and comparative study to the 
Elastic-Net is done in Section [71 We finally give experimental results in Section [8] 
showing the S-Lasso performances against some popular methods. All proofs are 
postponed to an Appendix section. 

2 The S-Lasso procedure 

As described above, we define the S-Lasso estimator f3 SL as the solution of the 
optimization problem (j2j) when the penalty function is: 

p 

pen(/3) = A|/?| 1 + /i^(^-/3,_ 1 ) 2 , (3) 

where A and p are two positive parameters that control the smoothness of our esti- 
mator. For any vector a = (a±, . . . , a p )', we have used the notation |a|i = Y^j=i \ a j\- 
Note that when p = 0, the solution is the Lasso estimator so that it appears as a 
special case of the S-Lasso estimator. Now we deal with the resolution of the S-Lasso 
problem ©-jS} and its computational cost. From now on, we suppose w.l.o.g. that 
X = (xi, . . . , x n )' is standardized (that is n~ x Yli=i x \j = 1 an d n ~ X ^"=1 Xi J = 0) 
and Y = (yx, . . . , y n )' is centered (that is n^ 1 Y^i=iVi = 0)- The following lemma 
shows that the S-Lasso criterion can be expressed as a Lasso criterion by augmenting 
the data artificially. 
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Lemma 1. Given the data set (X, Y) and (A, //). Define the extended dataset (X, Y) 
by 

1 f X \ , ,~, /r N 



and y 



where is a vector of size p containing only zeros and J is the p x p matrix 

/O ... \ 
1 -1 '•. '•. \ 

o i -l ■•. 

: '•• '•■ '•■ 

\0 ... 1 -1/ 

Let r = X/ a/1 + // and 6 = \/I +~jl/3. Then the S-Lasso criterion can be written 

2 



(4) 



Y -Xb 



+ r\b\ v 



Let b be the minimizer of this Lasso-criterion, then 

1 , 







SL 



This result is a consequence of simple algebra. Lemma [Q motivates the following 
comments on the S-Lasso procedure. 

Remark 1 (Regularization paths). The S-Lasso modification of the LARS is an 
iterative algorithm. For a fixed \i ( appearing (jBJ) ), it constructs at each step an 
estimator based on the correlation between covariates and the current residue. Each 
step corresponds to a value of X. Then for a fixed fi, we get the evolution of the 
S-Lasso estimator coefficients values when X varies. This evolution describes the 
regularization paths of the S-Lasso estimator which are piecewise linear f2l^ . This 
property implies that the S-Lasso problem can be solved with the same computational 
cost as the ordinary least square ( OLS) estimate using the Lasso modification version 
of the LARS algorithm. 

Remark 2 {Implementation). The number of covariates that the LARS algorithm 
and its Lasso version can select Js limited by the number of rows in the matrix X . 
Applied to the augmented data (X, Y) introduced in LemmaUi the Lasso modification 
of the LARS algorithm is able to select all the p covariates. Then we are no longer 
limited by the sample size as for the Lasso [TUj . 
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3 Theoretical properties of the S-Lasso estimator 
when p < n 

In this section we introduce the theoretical results according to the S-Lasso with a 
moderate sample size (p < n) . We first provide rates of convergence of the S-Lasso 
estimator and show how through a control on the regularization parameters we can 
establish root-n consistency and asymptotic normality. Then we look for variable 
selection consistency. More precisely, we give conditions under which the S-Lasso 
estimator succeeds in finding the set of the non-zero regression coefficients. We show 
that with a suitable choice of the tuning parameter (A,//), the S-Lasso is consistent 
in variable selection. All the results of this section are proved in Appendix A. 

3.1 Asymptotic Normality 

In this section, we allow the tuning parameters (A, /x) to depend on the sample size 
n. We emphasize this dependence by adding a subscript n to these parameters. We 
also fix the number of covariates p. Let us note I(-) the indicator function and define 
the sign function such that for any i6R, Sgn(a;) equals 1, —1 or respectively when 
x is bigger, smaller or equals 0. Knight and Fu [Hj gave the asymptotic distribution 
of the Lasso estimator. We provide here the asymptotic distribution to the S-Lasso. 
Let C n = n _1 X'X, be Gram matrix, then 

Theorem 1. Given the data set (X, Y), assume the correlation matrix verifies 

C n — > C, when n — > oo, 

in probability where C is a positive definite matrix. If there exists a sequence v n 
such that v n — > and the regularization parameters verify \-nV~ 1 — > A > and 
fi n v ~ 1 —> /i > 0. Then, if (y/nVn)^ 1 — » k > 0, we have 

v ~ 1 (/3 5L — (3*) —> Argmin V(u), when n — > oo, 

ueRp 

where 

p 

V(u) = -2ku t W + u T Cu + A { U J Sgn(/3*)I(/3* ^ 0) + \ Uj \ = 0)} 

3=1 

p 

+ {( Uj - u^w 3 - f}UW*i * > 

with W ~ Af(0,a 2 C). 
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Remark 3. When k ^ is a finite constant: in this case v n l is 0(y/n) so that 
the estimator /3 SL is root-n consistent. Moreover when A = ji = 0, we obtain the 
following standard regressor asymptotic normality: ^/n((3 SL — (3*) — > W(0, <r 2 C _1 ). 
When k — 0: in this case, the rate of convergence is slower than -Jn so that we no 
longer have the optimal rate. Moreover the limit is not random anymore. 

Note first that the correlation penalty does not alter the asymptotic bias when 
successive regression coefficients are equal. We also remark that the sequence v n 
must be chosen properly as it determines our convergence rate. We would like v n to 
be as close as possible to 1/y/n. This sequence is calibrated by the user such that 
An/^n — > A and fi n /v n -> fi. 

3.2 Consistency in variable selection 

In this section, variable selection consistency of the S-Lasso estimator is considered. 
For this purpose, we introduce the following sparsity sets: .4* = {j : (3* ^ 0} and 

= {j '■ Pj L 7^ 0}. The set A* consists of the non-zero coefficients in the vector 
of the oracle regression vector j3*. The set A n consists of the non-zero coefficients in 
the S-Lasso estimator $f L and is also called the active set of this estimator. Before 
stating our result, let us introduce some notations. For any vector a G MP and any 
set of indexes B G {1, . . . ,p}, denote by ag the restriction of the vector a to the 
indexes in B. In the same way, if we note \B\ the cardinal of the set B, then for any 
s x q matrix M, we use the following convention: i) Mgg is the \B\ x \B\ matrix 
consisting of the lines and rows of M whose indexes are in B; ii) M B is the s x \B\ 
matrix consisting of the rows of M whose indexes are in B; iii) Mb,, is the \B\ x q 
matrix consisting of the lines of M whose indexes are in B. Moreover, we define J 
the pxp matrix J'J where J was defined in j4]). Finally we define for j G {1, . . . ,p}, 
the quantity Qj = Qj(X, /i, A*, (3*) by 

= C^*(CU M * + jiZivi.)" 1 (2- 1 Sgn(^) + ^J A *, A *P%) - ^Jj,a*{3%, (5) 

where C is defined as in Theorem [U Now consider the following conditions: for every 
J e {A*) c 

\nj(x,n,A m ,^)\<i, (6) 

|fy(A, M M*,/F)| <i. (7) 
These conditions on the correlation matrix C and the regression vector (3* A * are the 
analogues respectively of the sufficient and necessary conditions derived for the Lasso 
([35], [34] and [32]). Now we state the consistency results 
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Theorem 2. If condition Q holds, then for every couple of regularization parame- 
ters (A„, w„) such that A n — > 0, Ann 1 / 2 — > oo and /x n — > ; S-Lasso estimator [3 SL 
as defined in ©-((SI) is consistent in variable selection. That is 

f(A n = A*) — > 1, when n — > oo. 

Theorem 3. If there exist sequences (X n , fi n ) such that f3 SL converges to (3* and A n 
converges to A* in probability, then condition (jZj) is satisfied. 

We just have established necessary and sufficient conditions to the selection con- 
sistency of the S-Lasso estimator. Due to the assumptions needed in Theorem [2] 
(more precisely \ n n 1 ^ 2 — > oo), root-n consistency and variable selection consistency 
cannot be treated here simultaneously. We may want to know if the S-Lasso estima- 
tor can be consistent with a slower rate than n 1//2 and consistent in variable selection 
in the same time. 

Remark 4. Here are special cases of conditions ©- ©■ 

When n = and fi/X = 0: these conditions are exactly the sufficient and necessary 
conditions of the Lasso estimator. In this case Yuan and Lin showed that the 
condition ([6]) becomes necessary and sufficient for the Lasso estimator consistency 
in variable selection. 

When jj, = and /i/A = 7 ^ 0: in this case, condition ([6j) becomes 

sup |C^.C^ 1 ^(2- 1 Sgn(^.) +lJ A *,A*P%) -lJj,A*P*A*\ < !■ 
ja(A*Y 

Here a good calibration 0/7 leads to consistency in variable selection: 

• if (Qj t A*Q'^ A , _4» Ja*,A* ~ Jj,A*)@A* > ®> then 7 must be chosen between 

l + 2^C;,^C^^Sgn(^) l-2- 1 C^C^ j _ 4 ,Sgn(/5» 
— ~ ~ and — ~ ~ . 

{GjrA*^A*,A*^A*A* ~ J'jA*)fi*A* i^j,A*^A*,A*'^- A *>^* ~ ^iA*)^A* 

• if (Cj^C^i a* J A*, A* ~ J j A*) ft a* < 0' then 7 must be chosen between the same 
quantities but with inversion in their order. 

When /j, 7^ and u/A = 7 7^ 0: this case is similar to the previous. In addition, it 
allows to have another control on the condition through a calibration with ji, so that 
condition ([6j) can be satisfied with a better control. 

We conclude that if we sacrifice the optimal rate of convergence (i.e. root-n 
consistency), we are able through a proper choice of the tuning parameters (A n ,/x„) 
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to get consistency in variable selection. Note that Zou |35j showed that the Lasso 
estimator cannot be consistent in variable selection even with a slower rate of con- 
vergence than \fn. He then added weights to the Lasso (i.e. the adaptive Lasso 
estimator) in order to get Oracles Properties (that is both asymptotic normality and 
variable selection consistency). Note that we can easily adapt techniques used in the 
adaptive Lasso to provide a weighted S-Lasso estimator which achieved the Oracles 
Properties. 

4 Theoretical results when dimension p is larger than 
sample size n 

In this section, we propose to study the performance of the S-Lasso estimator in 
the high dimensional case. In particular, we provide a non-asymptotic bound on the 
squared risk. We also provide bound on the estimation risk under the sup-norm (i.e., 
the Zoo-norm: \\(3 SL — /3*\\oo = sup,,- \/3j — /3*\). This last result helps us to provide 
a variable selection consistent estimator obtained through thresholding the S-Lasso 
estimator. The results of this section are proved in Appendix B. 



4.1 Sparsity Inequality 

Now we establish a Sparsity Inequality (SI) achieved by the S-Lasso estimator, that 
is a bound on the squared risk that takes into account the sparsity of the ora- 
cle regression vector j3* . More precisely, we prove that the rate of convergence is 
| .4* | log(n)/n. For this purpose, we need some assumptions on the Gram matrix C n 
which is normalized in our setting. Recall that £j = (x\j, . . . ,x n J)'. Then we define 
the regularization parameters A n and /jL n in the following forms: 



/log(p) 2 V io &(P) { Q\ 

X n = K 1 a\ , and fJ, n = K 2 cr , (8) 

V n n 

where K\ > 2y/2 and k 2 is positive constants. Let us define the maximal correlation 
quantity p\ = maxj£A* max -ke{i,...,p}\(Gn)j,k\- Using these notations, we formulate 

the following assumptions: 

• Assumption (Al). The true regression vector (3* is such that there exists a 
finite constant L\ such that: 

P* a Ja*,a*P*a* <£ilog(p)|.A1, (9) 
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where J = J'J where J was defined in 
Assumption (A2). We have: 



* £ «pi- (10) 



Note that Assumption (Al) is not restrictive. A sufficient condition is that the 
larger non-zero component of (3* A * is bounded by L\ log(p) which can be very large. 
Assumption (A2) is the well-known coherence condition considered in [3], which has 
been introduced in [7]. Most of Sis provided in the literature use such a condition. 
We refer to [3] for more details. 

Theorem [4] below provides an upper bound for the squared error of the estimator 
(3 s L and for its l\ estimation error which takes into account the sparsity index |.4*|. 

Theorem 4. Let us consider the linear regression model (pp). Let f3 be S-Lasso 
estimator. Let A* be the sparsity set. Suppose that p > n (and even p ^> n). If 
Assumptions (Al)-(A2) hold, then with probability greater than 1 — u n>p) we have 

\\Xp SL -XP*fn<C2 l0S{P)lA * 1 , (11) 

and 

SL -(3*\i<cJ l -^Pl\A% (12) 
V n 

where ci = (16«f + L\K<i)o 2 , C\ = (16«i + Lik^ 1 k 2 )o- and where u HtP = p 1 "* 1 ?/ 8 with 
Ki and K2, the constants appearing in (jSJ). 

The proof of Theorem 0] is based on the 'argmin' definition of the estimator (3 SL 
and some technical concentration inequalities. Similar bounds were provided for the 
Lasso estimator in [4]. Let us mention that the constants c\ and c 2 are not optimal. 
We focused our attention on the dependency on n (and then on p and |^4*|). It 
turns out that our results are near optimal. For instance, for the li risk, the S-Lasso 
estimator reaches nearly the optimal rate log(r^rr + 1) up to a logarithmic factor 
[31 Theorem 5.1]. 



4.2 Sup- norm bound and variable selection 

Now we provide a bound on the sup-norm — P SL \\oo- Thanks to this result, one 
may be able to define a rule in order to get a variable selection consistent estimator 
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when p 3> n. That is, we can construct an estimator which succeeds to recover the 
support of P* in high dimensional settings. 

Small modifications are to be imposed to provide our selection results in this section. 
Let K n be the symmetric p x p matrix defined by K n = C n + \i n J ■ Instead of 
Assumption (A2), we will consider the following 

• Assumption (A3). We assume that 

max \(K n ) jk \ < . 
j,ke{i,...,p} J 16\A*\ 



Remark 5. Note that the matrix J is tridiagonal with its off- diagonal terms equal 
to —1. If we do not consider the diagonal terms, we remark that C n and K n differ 
only in the terms on the second diagonals (i.e., (K n )j_ij ^ (C n ) 3 -_;y for j = 2, . . . ,p 
as soon as fi n ^ 0). Then, as we do not consider the diagonal terms in Assump- 
tions (A2) and (A3), they differ only in the restriction they impose to terms on the 
second diagonals. Terms in the second diagonals of C n correspond to correlations 
between successive covariates. Then when high correlations exist between successive 
covariates, a suitable choice of /i„ makes Assumption (A3) satisfied while Assump- 
tions (A2) does not. Hence, Assumption (A3) fits better with setup considered in the 
paper. 

In the sequel, a convenient choice of the tuning parameter [i n is /i„ = k 3 o"/ a/ n log (p) , 
where k 3 > is a constant. Moreover, from Assumption (Al), we have f3^* Ja*,a*@a* — 
L\ log (p) | A* | . This inequality guarantees the existence of a constant L 2 > such 
that || J/3* || oo < L 2 log(p). 

Theorem 5. Let us consider the linear regression model (pQ). Let X n = Kio^J\og(p)/n 
and n n = K S a/ ^Jn log (p) with K\ > 2^/2 and k 3 > 0. Suppose that p > n (and 
even p n). Under Assumptions (Al) and (A3) and with probability greater than 

1 — p l ~~» , we have 

V n 

where c equals to 

1 (3 1 4LiB 2LiB / 2Lj_B 8L1 L 2 B 2 4L 2 B L 2 B \ 

[i + + 9^42 + ^42" + y 3a (Q, _ i)A2 + g a ( a - l)A 4 " + ( H2~ + ~> n ) ' 
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Note that the leading term in c is f + ^ + ^§ + f^f + 3a ^-?)A^ 0ne ma y 
find back the result obtained for the Lasso by setting L\ to zero [H]. Secondly, the 
calibration of [i n aims at making the convergence rate under the sup-norm equal to 
A/log {p)/n. On one hand, the proof of Theorem [5] allows us to choose this parameter 
with a faster convergence to zero without affecting the rate of convergence. On the 
other hand, a more restrictive Assumption (Al) on (3^* J a*, a* Pa* an d 1 1-^/3* I loo can 
be formulated in order to make ji n converge slower to zero. If we let (3% Ja*,A*@*a* — 
Li \A*\ in Assumption (Al), we can set /x n as 0(^\og (p)/n), the slower convergence 
we can get for [i n . 

Let us now provide a consistent version of the S-Lasso estimator. Consider (3 ThSL , 
the thresholded S-Lasso estimator defined by f3 ThSL = (3 SL I((3 SL > c-^/log ip)/n) 
where c is given in Theorem El This estimator consists of the S-Lasso estimator 
with its small coefficients reduced to zero. We then enforce the selection property of 
the S-Lasso estimator. Variable selection consistency of this estimator is established 
under one more restriction: 

• Assumption (A4). The smallest non-zero coefficient of (3* is such that there 
exists a constant Q > with 



mm /?■ > Q\ . 

jeA* J V n 

Assumption (A4) bounds from below the smallest regression coefficient in (3* . This 
is a common assumption to provide sign consistency in the high dimensional case. 
This condition appears in pj2 EH [33l [34] but with a larger (in term of sample size n 
dependence) and then more restrictive threshold. We refer to [16] for a longer dis- 
cussion. An equivalent lower bound in the oracle regression coefficients can be found 
in [21 US]. With this new assumption, we can state the following sign consistency 
result. 

Theorem 6. Let us consider the thresholded S-Lasso estimator f3 ThSL as described 
above. Choose moreover X n = Kia ^J\og(p) / n and ji n = k%o~ / n log (p) with the 

positive constants K\ > 2y/2 and k?,. Under Assumptions (Al), (A3) and (A4), if 

4 

Ci > 2c with c is given by Theorem^ with probability greater than 1—p & , we have 

Sgn 0Th SL) = g gn(/n (13) 

and then as n —> +oo 

F(Sgn0 ThSL ) = Sgn(/3*)) -> 1. (14) 
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Remark 6. As observed in Remark [2, Assumption (A3) is more easily satisfied 
when correlation exists between successive covariates. Then in situations where the 
correlation matrix C n is tridiagonal with its off-diagonal terms equal to 6 with 5 G 
[0,1], the constant K3 appearing in the definition of jj n can be adjusted in order to 
get Assumption (A3) satisfied. 

5 Model Selection 

As already said [Remark [Q in Section [2], each step of the S-Lasso version of the 
LARS algorithm provides an estimator of (3*. In this section, we are interested in 
the choice of the best estimator according to its prediction accuracy. For a new nxp 
matrix x new of instances (independent of X), denote y SL = /3 SL x new the estimator 
of its unknown response value y new and m = ~E(y new \x new ) . We aim to minimize the 
true risk E {\\m — y SL \\n}- First, we easily obtain 

n 

E { \\m - y SL \\ 2 n } = E{\\Y - y SL \\ 2 n - a 2 + 2n l £ Cov(y 4> y? L )}, 

i=l 

where the expectation is taken over the random variable Y . The last term in this 
equation was called optimism [9]. Moreover, Tibshirani [25] links this quantity to the 
degree of freedom di(y SL ) of the estimator y SL , so that the above equality becomes 

E {\\m - f L \\l } = E {\\Y - y SL \\l - a 2 + 2n' &i{f L )a 2 } . (15) 

This final expression involves the degree of freedom which is unknown. Various meth- 
ods exist to estimate the degree of freedom as bootstrap [TT] or data perturbation 
methods [24]. We give an explicit form to the degree of freedom in order to reduce 
the computational cost as in [10] and [37] . 

Degrees of freedom: the degree of freedom is a quantity of interest in model 
selection. Before stating our result, let us introduce some useful properties about 
the regularization paths of the S-Lasso estimator: 

Given a response Y, and a regularization parameter \i > 0, there is a finite sequence 
= < A (A ' _1) < ... < A(°> such that (3 SL = for every A > A^. In this 
notation, superscripts correspond to the steps of the S-Lasso version of the LARS 
algorithm. 

Given a response Y, and a regularization parameter /1 > 0, for A G (A^ +1 \ A^), the 
same covariates are used to construct the estimator. Let us note A$ the active set 
for a fixed couple ( = (A, fi) and X, } j^ the corresponding design matrix. 
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In what follows, we will use the subscript ( to emphasize the fact that the con- 
sidered quantity depends on (. 



Theorem 7. For fixed /i > and X > 0, an unbiased estimate of the effective degree 
of freedom of the S-Lasso estimate is given by 



dm 



SL-s 

C > 



Tr 



where J = J'J is defined by 



J 



( 1 -1 

-1 2 -1 

o '•. ■•. 



-1 2 
-1 



0\ 



-1 

1/ 



(16) 



As the estimation given in Theorem [7] has an important computational cost, we 
propose the following estimator of the degree of freedom of the S-Lasso estimator: 



f) 



\Ac 



+ 



1 + 2^ 1 + fi 



(17) 



which is very easy to compute. Let l s be the sxs identity matrix where s is an integer. 
We found the former approximation of the degree of freedom under the orthogonal 
covariance matrix assumption (that is n~ l X'X = I p ). Moreover we approximate the 
matrix (I|^ A | + fiJ^xAx) by the diagonal matrix with 1 + // in the first and the last 
terms, and 1 + 2/x in the others. 

Remark 7 (Comparison to the Lasso and the Elastic-Net). A similar work leads 
to an estimation of the degree of freedom of the Lasso: df(y^) = \A$\ and to an 

estimation of the degree of freedom of the Elastic-Net estimator: df(y^ N ) = |^|/(1 + 
//). These approximations of the degrees of freedom provide the following comparison 
for a fixed (: df(y^ L ) < df(y^ N ) < df(y^). A conclusion is that the S-Lasso estimator 
is the one which penalizes the smaller models, and the Lasso estimator the larger. 
As a consequence, the S-Lasso estimator should select larger models than the Lasso 
or the Elastic-Net estimator. 
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6 The Normalized S-Lasso estimator 



In this section, we look for a scaled S-Lasso estimator which would have better em- 
pirical performance than the original S-Lasso presented above. The idea behind this 
study is to better control shrinkage. Indeed, using the S-Lasso procedure ([S)-© in- 
duces double shrinkage: one using the Lasso penalty and the other using the fusion 
penalty. We want to undo the shrinkage implied by the fusion penalty as shrinkage is 
already ensured by the Lasso penalty. We then suggest to study the S-Lasso criterion 
([2|)-((3]) without the Lasso penalty (i.e. with only the /2-fusion penalty) in order to 
find the constant we have to scale with. 



P = Argmin||r - X/3\\ 2 n + /1 (ft - ft_x) 5 



Define 



We easily obtain (3 = ({X'X)/n + fxJ)-\X'Y) /n := lr\XY)/n where J is given 
by ( fT6l ). Moreover as the design matrix X is standardized, the symmetric matrix L 
can be written 



" n 



§j gg 
n 



n 



1 + 2// 



n 



1 + 2/i 



£p-2 £p 
n 

-1 £p 



V 



In order to get rid of the shrinkage due to the fusion penalty, we force L to have 
ones (or close to a diagonal of ones) in its diagonal elements. Then we scale the 
estimator (3 by a factor c. Here are two choice we will use in the following of the 
paper: i) the first is c = 1 + /i so that the first and the last diagonal elements of h 1 
become equal to one; ii) the second is c = 1 + 2/i which offers the advantage that 
all the diagonal elements of L -1 become equal to one except the first and the last. 
This second choice seems to be more appropriate to undo this extra shrinkage and 
specially in high dimensional problem. 



We first give a generalization of Lemma [U 

Lemma 2. Given the dataset (X,Y) and (Ai,/t). Define the augmented dataset 
(X,Y) by 

-+ x 

/njJJ 



X 



and Y 
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where V\ is a constant which depends only on jj, and J is given by (j4j). Let r = X/ui 
and b = (z/ 2 /c)/3 where v 2 is a constant which depends only on fi, and c is the scaling 
constant which appears in the previous study. Then the S-Lasso criterion can be 
written 



Y -Xb 



+ r\b\x. (18) 



Let b be the minimizer of this Lasso-criterion, then we define the Scaled Smooth Lasso 
(SS-Lasso) by 

P SSL = P SSL (v 1 ,v 2 ,c) = (c/v 2 )b. 
Moreover, let J = J'J. Then we have 

pSSL = Argmin (3 - 2^(3 + A £ J . (19) 



Equation ( 1191 ) is only a rearrangement of the Lasso criterion ([TBI ). The SS-Lasso 
expression (fT9l) emphasizes the importance of the scaling constant c. In a way, the 
SS-Lasso estimator stabilizes the Lasso estimator [3 L (criterion fTLSl) based in (X, Y) 
instead of (X, Y)) as we have 

{/X'X\ Y'X p 
P'l 1/3-2 /3 + AVI/3,, 
V n J n 

The choice of v\ and z/ 2 should be linked to this scaling constant c in order to 
get better empirical performances and to have less parameters to calibrate. Let us 
define some specific cases, i) Case 1: When V\ — u 2 — VI + A* an d c = 1: this is the 
"original" S-Lasso estimator as seen in Section El ii) Case 2: When v\ = v 2 = + \i 
and c = 1 + jj,: we call this scaled S-Lasso estimator Normalized Smooth Lasso (NS- 
Lasso) and we note it f3 NSL . In this case, we have [3 NSL = (1 + fi(3 SL ). iii) Case 
3: When V\ — v 2 — -y/1 + 2/i and c = 1 + 2fi: we call this scaled version Highly 
Normalized Smooth Lasso (HS-Lasso) and we note it j3 H . 

Others choices are possible for v\ and z/ 2 in order to better control shrinkage. For 
instance we can consider a compromise between the NS-Lasso and the HS-Lasso by 
defining v\ = 1 + // and z/ 2 = 1 + 2\i. 

Remark 8 (Connection with Soft Thresholding). Let us consider the limit case of 
the NS-Lasso estimator. Note (3^ SL = Hindoo (3 NSL , then using (fT9l . we have 

pNSL = Argmin {p>p _ 2Y'X(3 + \\[3\i}. 
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As a consequence, 0^ SL )j = (\Y'£j\ — |) Sgn(F'£j) which is the Univariate Soft 
Thresholding JSj- Hence, when jj, — ► oo, the NS-Lasso works as if all the covariates 
were independent. The Lasso, which corresponds to the NS-Lasso when /J, — 0, often 
fails to select covariates when high correlations exist between relevant and irrelevant 
covariates. It seems that the NS-Lasso is able to avoid such problem by increasing 
fjL and working as if all the covariates were independent. Then for a fixed X, the 
control of the regularization parameter fi appears to be crucial. When we vary it, the 
NS-Lasso bridges the Lasso and the Soft Thresholding. 



7 Extension and comparison 

All results obtained in the present paper can be generalized to all penalized least 
square estimators for which the penalty term can be written as: 

ven(P) = \\/3\ 1 + /3'Mp, (20) 

where M is p x p matrix. In particular, our study can be extended for instance to the 
Elastic-Net estimator with the special choice M = I p . Such an observation underlines 
the superiority of the S-Lasso estimator on the Elastic-Net in some situations. Indeed, 
let us consider the variable selection consistency in the high dimensional setting (cf. 
Section l4~2l) . Regarding the Elastic-Net, Assumption (A3) becomes 

• Assumption (A3-EN). We assume that 

,4w} l(Cnkfc + /iJp| -T^41- (21) 

Since the identity matrix is diagonal and since the maximum in f|2Tl) is taken over 
indexes k ^ j, condition (12T1) reduces to max^^i,...^} |(C n ) J; fc| < ^r^rr ■ This makes 

Assumption (A3-EN) similar to the assumption needed to get the variable selection 
consistency of the Lasso estimator [2]. Hence, we get no gain to use the Elastic-Net 
in a variable selection consistency point of view in our framework. This ables us to 
think that the S-Lasso outperforms the Elastic-Net at least on examples as the one 
in Remark [6l Recently, Jia and Yu [T3] studied the variable selection consistency of 
the Elastic-Net under an assumption called Elastic Irrepresentable Condition: 

• (EIC). There exists a positive constant 6 such that for any j G (A*) c 

|C^.(C^ + fjj.)- 1 (2- 1 Sgn(fo.) + I < 1 - e. 
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This condition can be seen as a generalization of the Irrepresentable Condition in- 
volved in the Lasso variable selection consistency. 

Let us discuss how the two assumptions can be compared in the case p 3> n. First, 
note that Assumption (A3-EN), as well as EIC suggests low correlations between 
covariates. Moreover Assumption (Al), (A4) and (A3-EN) seem more restrictive 
than EIC as all the correlations are constrained in (l2"Tl) . However, EIC is harder 
to interpret in term of the coefficients of the regression vector (3*. It also depends 
on the sign of (3* . The main difference is that the consistency result in the present 
paper holds uniformly on the solutions of the Elastic Net criterion while the result 
from [13] hinges upon the existence of a consistent solution for variable selection. 
Obviously, this is more restrictive as we are certain to provide the sign-consistent 
solution under the EIC. Finally, we have also provided results on the sup-norm and 
sparsity inequalities on the squared risk of our estimators. Such results are new for 
estimators defined with the penalty (l20l) . including the S-Lasso and the Elastic-Net. 

8 Experimental results 

In the present section we illustrate the good prediction and selection properties of the 
NS-Lasso and the HS-Lasso estimators. For this purpose, we compare it to the Lasso 
and the Elastic-Net. It appears that S-Lasso is a good challenger to the Elastic-Net 
[36] even when large correlations between covariates exist. We further show that 
in most cases, our procedure outperforms the Elastic-Net and the Lasso when we 
consider the ratio between the relevant selected covariates and irrelevant selected 
covariates. 

Simulations: 

Data. Four simulations are generated according to the linear regression model 

y = x(3* + ae, e~jV(0,l), x = (&, . . . , Q e W. 

The first and the second examples were introduced in the original Lasso paper [25]. 
The third simulation creates a grouped covariates situation. It was introduced in 
[36] and aims to point the efficiency of the Elastic-Net compared to the Lasso. The 
last simulation introduces large correlation between successive covariates. 
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(a) In this example, we simulate 20 observations with 8 covariates. The true re- 
gression vector is (3* = (3, 1.5, 0, 0, 2, 0, 0, 0)' so that only three covariates are 
truly relevant. Let a = 3 and the correlation between £j and such that 

Covfo,a) = 2-li- fc l. 

(b) The second example is the same as the first one, except that we generate 
50 observations and that (3* = 0.85 for every j E {1, ...,8} so that all the 
covariates are relevant. 

(c) In the third example, we simulate 50 data with 40 covariates. The true regres- 
sion vector is such that [3* = 3 for j = 1, . . . , 15 and Pj=0 for j = 16, . . . , 40. 
Let a = 15 and the covariates generated as follows: 

£ J =Z 1 +e v Z 1 ^^(0,1), j = l,...,5, 
^ = Z 2 + e v Z 2 ~Af(0,l), j = 6,...,10, 
Z 3 I r Z 3 ~M(0,1), j = ll,...,15, 

where £j, j = 1, . . . , 15, are i.i.d. A/"(0,0.01) variables. Moreover for j = 
16, ... , 40, the £j's are i.i.d AA(0, 1) variables. 

(d) In the last example, we generate 50 data with 30 covariates. The true regression 
vector is such that 

fy= 3-O.lj j = l,...,10, 
P j= -5 + 0.3j j = 20,...,25, 
(3j = for the others j. 

The noise is such that a = 9 and the correlations are such that Cov(£j, = 
exp (— ^^-) for (j,k) E {11,..., 25} 2 and the others covariates are i.i.d. 
J\f(Q, 1), also independent from £n,...,£25- In this model there are big cor- 
relation between relevant covariates and even between relevant and irrelevant 
covariates. 

Validation. The selection of the tuning parameters A and fi is based on the min- 
imization of a BIC-type criterion |22j. For a given f3 the associated BIC error is 
defined as: 

Bic(^) = ||y-x^ + ^^df(/3), 

n 

where df(/3) is given by ( fTTI) if we consider the S-Lasso and denotes its analogous 
quantities if we consider the Lasso or the Elastic-Net. Such a criterion provides an 
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Method 


Example (a) 


Example (b) 


Example (c) 


Example (d) 


Lasso 
E-Net 
NS-Lasso 
HS-Lasso 


3.8 [±0.1] 

4.9 [±0.1] 
3.9 [±0.1] 
3.5 [±0.1] 


6.5 [±0.1] 
6.9 [±0.1] 
6.5 [±0.1] 
5.9 [±0.1] 


6 [±0.1] 
15.9 [±0.1] 
15.3 [±0.2] 

15 [±0.1] 


18.4 [±0.2] 

20.5 [±0.2] 
18.9 [±0.2] 
18.1 [±0.2] 



Table 1: Mean of the number of non-zero coefficients [and its standard error] selected 
respectively by the Lasso, the Elastic-Net (E-Net), the Normalized Smooth Lasso 
(NS-Lasso) and the Highly Smooth Lasso (HS-Lasso) procedures. 



accurate estimator which enjoys good variable selection properties ([23] and [30]). 
In simulation studies, for each replication, we also provide the Mean Square Error 
(MSE) of the selected estimator on a new and independent dataset with the same 
size as training set (that is n). This gives an information on the robustness of the 
procedures. 

Interpretations. All the results exposed here are based on 200 replications. Figured] 
and Figure [2] give respectively the BIC error and the test error of the considered 
procedures in each example. According to the selection part, Figure [3] shows the fre- 
quencies of selection of each covariate for all the procedures, and Table CD shows the 
mean of the number of non-zeros coefficients that each procedure selected. Finally 
for each procedure, Table [2] gives the ratio between the number of relevant covariates 
and the number of noise covariates that the procedures selected. Let us call SNR 
this ratio. Then we can express this ratio as 

SNR W0'^*> 



This is a good indication of the selection power of the procedures. 

As the Lasso is a special case of the S-Lasso and the Elastic-Net, the Lasso 
BIC error (Figure CD) is always larger than the BIC error for the other methods. 
These two seem to have equivalent BIC errors. When considering the test error 
(Figure EJ), it seems again that all the procedures are similar in all of the examples. 
They manage to produce good prediction independently of the sparsity of the model. 

The more attractive aspect concerns variable selection. For this purpose we treat 
each example separately. 

Example (a): the Elastic-Net selects a model which is too large (Table [1]). This is 
reflected by the worst SNR (Table EJ). As a consequence, we can observe in Figure [3] 



20 



Method 


Example (a) 


Example (c) 


Example (d) 


Lasso 
E-Net 
NS-Lasso 
HS-Lasso 


2.3 [±0.1] 
1.7 [±0.1] 
2.5 [±0.1] 
1.79 [±0.1] 


2.9 [±0.1] 
13.1 [±0.3] 
13.5 [±0.3] 
11.4 [±0.3] 


4.7 [±0.2] 
3.4 [±0.2] 

6.8 [±0.3] 
6.4 [±0.3] 



Table 2: Mean of the ratio between the number of relevant covariates and the number 
of noise covariates (SNR) [and its standard error] that each of the Lasso, the Elastic- 
Net, the NS-Lasso and the HS-Lasso procedures selected. 



that it also includes the second covariate more often than the other procedures. This 
is due to the "grouping effect" as the first covariate is relevant. For similar reasons, 
the S-Lasso often selects the second covariate. However, this covariate is less selected 
than by the Elastic-Net as the S-Lasso seems to be a little bit disturbed by the third 
covariate which is irrelevant. This aspect of the S-Lasso procedure is also present 
in the selection of the covariate 5 as its neighbor covariates 4 and 6 are irrelevant. 
We can also observe that the S-Lasso procedure is the one which selects less often 
irrelevant covariates when these covariates are far away from relevant ones (in term 
of indices distance). Finally, even if the Lasso procedure selects less often the rele- 
vant covariates than the Elastic-Net and the S-Lasso procedures, it also has as good 
SNR. The Lasso presents good selection performances in this example. 
Example (b): we can see in Figure [3] how the S-Lasso and Elastic-Net selection de- 
pends on how the covariates are ranked. They both select more covariates in the 
middle (that is covariates 2 to 7) than the ones in the borders (covariates 1 and 8) 
than the Lasso. We also remark that this aspect is more emphasized for the S-Lasso 
than for the Elastic-Net. 

Example (c): the Lasso procedure performs poorly. It selects more noise covariates 
and less relevant ones than the other procedures (Figure [3]). It also has the worst 
SNR (TableE]). In this example, Figure[3]also shows that the Elastic-Net selects more 
often relevant covariates than the S-Lasso procedures but it also selects more noise 
covariates than the NS-lasso procedure. Then even if the Elastic-Net has very good 
performance in variable selection, the NS-Lasso procedure has similar performances 
with a close SNR (Table [2]). The NS-Lasso appears to have very good performance 
in this example. However, it selects again less often relevant covariates at the border 
than the Elastic-Net. 

Example (d): we decompose the study into two parts. First, the independent part 
which considers covariates £i, . . . , £io and £ 26 , . . . , £ 30 . The second part considers the 
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Example (a) 



Example (b) 
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Figure 1: BIC error in each example. For each plot, we construct the boxplot for the 
procedure 1 = Lasso; 2 = Elastic-Net; 3 = NS-Lasso; 4 = HS-Lasso 



other covariates which are dependent. Regarding the independent covariates, Fig- 
ure [3] shows that all the procedures perform roughly in the same way, though the 
S-Lasso procedure enjoys a slightly better selection (in both relevant and noise group 
of covariates) . For the dependent and relevant covariates, the Lasso performs worst 
than the other procedures. It selects clearly less often these relevant covariates. As in 
example (c), the reason is that the Lasso modification of the LARS algorithm tends 
to select only one representative of a group of highly correlated covariates. The high 
value of the SNR for the Lasso (when compared to the Elastic-Net) is explained by 
its good performance when it treat noise covariates. In this example the Elastic-Net 
correctly selects relevant covariates but it is also the procedure which selects the more 
noise covariates and has the worst SNR. We also note that both the NS-Lasso and 
HS-Lasso outperform the Lasso and Elastic-Net. This gain is emphasized especially 
in the center of the groups. Observe that for the covariates £20; £21, £25 and £ 2 6 (that 
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Figure 2: Test Error in each example. For each plot, we construct the boxplot for 
the procedure 1 = Lasso; 2 = Elastic-Net; 3 = NS-Lasso; 4 = HS-Lasso 



is the borders), the NS-Lasso and HS-Lasso have slightly worst performance than 
in the center of the groups. This is again due to the attraction we imposed by the 
fusion penalty (02) in the S-Lasso criterion. 

Conclusion of the experiments. The S-Lasso procedure seems to respond to our 
expectations. Indeed, when successive correlations exist, it tends to select the whole 
group of these relevant covariates and not only one representing the group as done 
by the Lasso procedure. It also appears that the S-Lasso procedure has very good 
selection properties according to both relevant and noise covariates. However it has 
slightly worst performance in the borders than in the centers of groups of covariates 
(due to attractions of irrelevant covariates). It almost always has a better SNR than 
the Elastic-Net, so we can take it as a good challenger for this procedure. 
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Lasso 
Elastic-Net 
NS-Lasso 
HS-Lasso 



+ 8 s + 4 * 



Lasso 
Elastic-Net 







Figure 3: Number of covariates detections for each procedure in all the exam- 
ples (Top-Left: Example (a); Top- Right: Example (b); Bottom-Left: Example (a); 
Bottom-Right: Example (b)) 
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9 Conclusion 

In this paper, we introduced a new procedure called the Smooth-Lasso which takes 
into account correlation between successive covariates. We established several theo- 
retical results. The main conclusions are that when p < n, the S-Lasso is consistent 
in variable selection and asymptotically normal with a rate lower than yfn. In the 
high dimensional setting, we provided a condition related to the coherence mutual 
condition, under which the thresholded version of the Smooth-Lasso is consistent 
in variable selection. This condition is fulfilled when correlations between succes- 
sive covariates exist. Moreover, simulation studies showed that normalized versions 
of the Smooth-Lasso have nice properties of variable selection which are empha- 
sized when high correlations exist between successive covariates. It appears that the 
Smooth-Lasso almost always outperforms the Lasso and is a good challenger of the 
Elastic-Net. 



Appendix A. 

Since the matrix C n + /i n J plays a crucial role in the proves, we use to shorten the 
notation K n = C n + /i n J and when p < n we define K = C + fiJ, its limit. 
In this appendix we prove the results when p < n. 

Proof of Theorem H Let \l/ n be 

v 

* n (u) = \\Y-X(P* + V n u)\\ 2 n + \ n J2\Pj+VnUj\ 

3=1 
V 

i=2 

for u = (tti, • • • , Up)' £ K p and let u = Argmin n *$f n (u). Let e = 



-Uj-i)) , 

. . . , £ n Y, we then 
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have 

*„(«)- *n(0) =: V n {u) 



V 

-V n fJ> n 

3=2 



= vl 



(X'X\ v e'X p 

v v j =1 

E^n 1 { {P* ~ + V n {u 3 - «,-_!)) 2 - {P* ~ PUT) 

3=2 

+ v E »»' {("1 - + "«(^ - "'-i') 2 - - ?u?} 

n 3=2 

= i%y n {u). 

Note that u = Argmin n ^ n (u) = Argmin u V n (u), we then have to consider the limit 
distribution of V n (u). First, we have — > C. Moreover, as l/(v ny /n) — > k and as 

given X, the random variable ^7= W, with W ~ A/"(0, <r 2 C), the Slutsky theorem 
implies that 

' C ' X :U^2kW'u. 



v n y/n 

Now we treat the last two terms. If f3* 7^ 0, 

v- 1 (\p* + VnUj \- -^Sgn(/3*), 

and is equal to \uj\ otherwise. Then, as 

p v 



^ E ^ 0# + - l#D "> A E K Sgn(# W ^ 0) + = 0)} , 



For the remaining term, we show that if (3j 7^ fij-i, 

V'n { {Pj - + VniUj - U^)) 2 - ((3* - ^f) -> 2( M , - U^)((3* - 

and is equal to otherwise. But \i n converge to 0, implies that 



3=2 
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3=2 

Therefore we have V n (u) — > V(u) in probability, for every u G W. And since C is a 
positive defined matrix, V(u) has a unique minimizer. Moreover as V n (u) is convex, 
standard M-estimation results [28j lead to: u n — > Argmin^ V(u). □ 

Proof of Theorem [H We begin by giving two results which we will use in our proof. 
The first one concerns the optimality conditions of the S-Lasso estimator. Recall 
that by definition 

$ SL = Argmin \\Y - X(3\\ 2 n + + fi n (3'J(3. 

Note /(a)| =ao the evaluation of the function / at the point a . As the above problem 
is a non-differentiable convex problem, classical tools lead to the following optimality 
conditions for the S-Lasso estimator: 

Lemma 3. The vector (3 SL = {/3f L , . . . ,/3p L )' is the S-Lasso estimate as defined in 
©-(UK if and only if 

= -A n Sgn(/3f) forj: /3f + 0, (22) 



\\Y-Xf5\\ 2 n + ^ n f3'Jf3 

Recall that A* = {j : (3* ^ 0}, the second result states that if we restrict ourselves 
to the covariates which we are after (i.e. indexes in A*), we get a consistent estimate 
as soon as the regularization parameters X n and \i n are properly chosen. 

Lemma 4. Let (3 A * a minimizer of 

\\Y - X A *(3 A * \\l + X n J2 Wi\ + VnP A *JA*,A*PA*- 

jeA* 

If A„ — > and \i n — > , then j3 A * converges to (3* A , in probability. 



Y-X(3\\ 2 n + ^ n P'J(3 
df3j 



< X r 



/"/•./: T 0. (23) 
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This lemma can be see as a special and restricted case of Theorem [TJ We now 
prove Theorem El Let (3 A * as in Lemma BJ We define an estimator (3 by extending 
(3 A * by zeros on (A*) c . Hence, consistency of f3 is ensure as a simple consequence of 
Lemma 01 Now we need to prove that with probability tending to one, this estimator 
is optimal for the problem (El)- (JHJ) . That is the optimal conditions (l22]) -(l23l) are 
fulfilled with probability tending to one. 

From now on, we denote A for A*. By definition of [3 A , the optimality condi- 
tion ( 1221) is satisfied. We now must check the optimality condition (1231) . Combining 
the fact that Y = X(3* + e and the convergence of the matrix X'X/n and the vector 
e'X/ \/n, we have 

n-\X'Y - X'Xj A ) = C, A (P* A - p A ) + O p {n- 1 ' 2 ). (24) 

Moreover, the optimality condition (|22~1) for the estimator (3 can be written as 

n-\X[ A Y - X[ A X.,J A ) = ^ Sgn(^) - fiJ A , A (P A ~ Pa) + »Ja,aP\- (25) 
Combining (1241) and (125| . we easily obtain 

(Pa - Pa) = (Ca,a + ^JaaY 1 (y Sgn(^) + HuJaaPa) + <?>~ 1/2 )- 

Since (3 is consistent and \ n rt}l 2 — > oo, for each j E A c , the left hand side in the 
optimality condition ( !23i) 

^-(gy - CjX., A p A ) - ^\aPa =■■ 

converges in probability to 

C^KaaY 1 Sgn(^) + jJaaPa) ~ jJjaPa =■ L r 
By condition ([6|), this quantity is strictly smaller than one. Then 

lim P (Vj G A c , |4 n) | < l) > FT P (I L ?I < 1) = 1, 

n— >oo \ J / -I- 

jeA c 

which ends the proof. □ 
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Proof of Theorem We prove the theorem by contradiction by assuming that there 
exists a j 6 (A*) c such that there exists a i G A* and 

\Q,(\,fi,A*,(3*)\ > 1, 
where the flj are given by j5j). Since A n = A* with probability tending to one, 
optimality condition ( |22l implies 



Pf = {{K n ) AA) ^ ( ^ - ^ Sgn(/3f ) ) . (26) 



Using this expression of (3 A and Y = X. A (3* A + e, then for every j G *4. c , 

= {{J^n)A,A> 

n n n n n 

+y^((^n)A^)- 1 Sgn(/3f) 

£*Y CaX a , -X^ aE C'^X a 

n n n n 



Therefore, 
with 



+2L _^ {{Kn)A A) -L Sgn(^) + /i n J A ^ 

We treat this two terms separately. First as fi A L converges in probability to j3 A and 
empirical covariance matrices convergence, the sequence B n / X n converges to 

B = C jA {K AA )-\2- 1 \ Sgn(^) + ^JaaPa) ~ ^-%aPa- 

By assumption \B\ > 1. This implies that P (B n /\ n > (1 + \B\)/2) converges to one. 
With regard to the other term, since Y = Xf3* + e we have 

n n n 

l/2\ 



fc=i 



fc=l 

n 



k=l 
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where c n are i.i.d. random variables with mean and variance: 

s 2 = Yar(c k ) = E(c 2 k ) = E[E(c 2 h \X)\ 

= E [E(s 2 \X)(x kJ - CUKAAr^'kA?] 

= a 2 E [Cjj + C j ^(K AtA y 1 Cj L ^(K AtA )~ 1 CA,j 

Thus, by the central limit theorem, n l l 2 C n is asymptotically normal with mean 
and covariance matrix s 2 /n, which is finite. Thus ¥(n l l 2 A n > 0) converges to 1/2. 

Finally, ¥((A n + B n )/X n > (1 + \B\)/2) is asymptotically bounded below by 1/2. 
Thus \(A n + B n )/X n \ is asymptotically bigger than 1 with a positive probability, that 
is to say the optimality condition ( 1231 ) is not satisfied. Then f3 SL is not optimal. We 
get a contradiction, which concludes the proof. □ 

Appendix B. 

In this appendix we mainly prove the results when p > n. 

Proof of Theorem^ Using the definition of the penalized estimator (J2j) — (J3j) , for any 
(3 E M p , we have 

2 n ~ 

wxp sL - xp*\\l - -J2 £ ^ SL + x ^ SL \i + »n0 SL yjp SL 

i=l 

2 n ~ 
< \\Xf3 - Xf3*\\l - - J2 WiP + K\P\l + VnP'jP- 

i=i 

Therefore, if we chose /? = /?*, we obtain the following inequalities: 

\\x$ sL -xni < \nJ2{w-\^ L \)+lJ2 £ ^ SL -^ 



j=l i=l 

+fi n (/3*'J(3* - SL )'JP SL ) 

2 " 



< (1/^*1 -i4 Si \) + -Y, £ MP SL -n 



n 

3=1 i=l 

+^ n (3*'j(3\ (27) 
as fi'J/3 > for any (3 G MP. In order to control ( 1271 ). we use in a first time Assump- 
tion (Al) so that jjL n (3* J(3* < LiK 2 cr 2 ^_P_j^. I _ g ec0 nd we bound the residual term 

n 
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in the same way as in [4]. Then, we only present here the main lines. Recall that 
A = A* = {j : /5* 7^ 0}. Then, on the event A niP = {max 7= i ) ... )P 4|T^| < A n } with 
Vj = n~ l Ya=i x i,j £ i, we have 



\\XP SL - XP\\l + 2- l \ n < X n E W - Pj 

3=1 jeA 



2 \og(p)\A\ 



n 



.(28) 



This inequality is obtained thanks to the fact that \/3* - Pf L \ + \/3*\ - \Pj \ = 
for any j ^ A and to the triangular inequality. The rest of the proof consists in 



bounding this term X n ^ gyl 



Pf L ~ ft 



Using similar arguments as in [4], we can 



write 



£(/3f -P*) 2 < \\XP SL -Xp*\\l + 2 Pl J2\P S k L -Pl\J2\^ L -^\ 



jeA 



keA 



3=1 



-Pi 



E \M L - % 

\jeA 



(29) 



But (J2 jeA /3f -P* < \A\ E^aWj ~ then 



\jeA / 

< \A\ \\\Xp SL -Xp*\\ 2 n + 2 Pl J20 S k L -^lE^f 



keA 

-pi (ei/¥ l 

A simple optimization implies 

2prN E? =1 l^f 



(30) 



< 



l + Pi\A\ 



+ 



l + Pi\A\ 



3*112 



(31) 



Now, use Assumption (A2) to bound the left hand side of the inequality (l3Pj) and 
combine this to (l28l) to get 



iix/3 5L -x^i|2 + A n E|/3f 



i2 , , . ,log(p)|.A| 



< 16\Z\A\ + L lK2 a 2 



n 



(32) 
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This proves ( fTTl) . Finally (fT2l ) follows directly by dividing by X n both sides of this 
last inequality. A concentration inequality to bound P(max ? -=i 1 ... )P 4|Vj| < A n ) allows 
us to conclude the proof. □ 

Lemma 5. Let A„ p be the random event defined by A n p = {maxj = i i ... )P 4|T^| < A n } 
where Vj = n -1 Y^i=i x i,j £ i- Let us choose a K\ > 2\/2 and X n = K^o^nr 1 log(p). 
Then 



P ( max 4|K| < A„ ) > 1 - p 1 '^ 



Proof. Since Vj ~ A/"(0,n V 2 ), an elementary Gaussian inequality gives 



P 



( max X'^VA > A' 1 ) < p max ¥ (X^lVA > A" 1 ) 
\j=h-,p J j=h-,p 



< pexp (— k\ log(p)/8) 
This ends the proof. □ 



Proof of Theorem^ Through this proof, for any a G MP, let us denote by a^, the 
p-dimensional vector such that (a^)j = aj if j G A and zero otherwise. Moreover, we 
recall that K n = C n + \i n J ■ Now, note that we can write the KKT conditions (1221) - 
((231) as 

\\K n {fi SL -/?*)- ^ + ^J/FIU < y . (33) 

Recall that A UtP = {maxj^...^ 2\Vj\ < X n } with Vj = -f-, then applying (1331) and 
Assumption (A4), we have on A n p and for any j G {1, . . . ,p} 

\{K n )^ L - (3*)\ = \{K n SL - (3*)}, - j2(K n ) jjk s k L - (3D + n n {J(3*)i\ 

k=l 

- Y + + £ \( K »)M L - PI) + A*»(Wil 
fc=l 

Then 

- /3*)|U < -p + tt^I/3 5L - + A*»WII«*, (34) 
4 da|yl| 
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Let us now bound \f3 SL — (3*\\. Thanks to ( !27l) . we can write 



2 P 



X n SL \i < Xn\/3*\i + - V^(/3 5L - /?*) + // n /3*' J/3* 

n L — ' 

i=i 

^Anl^K < + - /Tlx + /W?*'J>. 

Dividing by A n , and adding 2 _1 |/3 5L — 0*|i — |/3 |i, we get on the event A„ iP 

(35) 

An 

^ii 2 + ^tim*, (36) 

An 

where we used the Cauchy Schwarz inequality in the last line. Combine ( 1341 ) and 
(l36l ). we easily get 





-0*|i 


<(|/3 5L - 


01 1 


1/3^- 


0*|i 


<2|/3* L - 


0*41: 




01 i 


<2v^4j| 





ii0 SL - 0iu < ^ + \m L - p. 



+l^n\\jP*\\oo + ^ Xl P* A JP* A ). (37) 



+ [I 



The final step consists in bounding ||0^ L — /^Jh- First, using the KKT condition (1331 . 
we remark that \\K n ((3 SL - 0*)|| oo < 3A n /4 + J0*||oo on A„ iP . This and equa- 
tion (l36l) lied to 



{(3 bL - (3*)'K n {(3 bL - (3*) < \\K n ((3 bL - \(3 bL - (3*\, 



< (^ + ^|| J(3*\U(2^\A\\\P S A L - (3 A \\ 2 + 2^f3*;jf3 A ) 



(38) 
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On the other hand, using Assumption (A4), and similar arguments as in [T6] . 
0% - P A )'K n 0f - ft) 0f - (3 A )> &*g{K n )0 s A L - (3 A ) 



SL _ o* 112 
Pa\\2 



'A 



W S a L ~Pa\\1 
S a L ~ fid'&n - diag(*Q)(/3^ - ft) 



> 1 



> 1 



1 p 



_ 112 
'A PAWl 



m L ~ PA)i\\0 S A L ~ Pa) 



3a\A\ 
3a\A\ - {3* A g 



j,k=i n^- 4 

' SL a* ||2 



where we used in the second inequality the fact that diag(fT n ) has larger diagonal 
elements than 1 since the diagonal elements in C n and J are respectively equal to 1 
and larger than 0. Now, twice using Assumption (A4), one deduces 

SL -P*)'K n SL -p*) > 0SL-(3* A y Kn s A L -(3 A ) | (g#_- P A .)'K n $% - P% 



s '/- ._ p* ii-' 



A 



Mil 2 



ISL _ n* 112 
J A PaWi 



> 1 



> 1 



0a-Pa\ 2 i \Pa - Pa\i\P% - PUi 



3a\A\ \\$f-l3 A \\l 
\P5l L -Pa\1 



3a\A\ W S A L -P M2 



?* 112 



2/i n ftJft -A 



> fl- 



1 2/i n ft'jft |lf -ft|i 



.4112 



where we used the fact that implies |/3^ - < 2 |/3^ L - ft|i + 2^(3* A Jf3 A 
in the third line. The last inequalities can be summed-up by 



SL -(3*)'K n {^ L -f) > (1 



SL 



!\ || aSL n*\\2 2 / i " Pa JP*A I nSL a* I /oq\ 

-)ll^ -&|| 2 - |& -^li- (39) 



Let us consider (j38l) and (|39l) . An optimization work over — /9^|j 2 provides us 
the following bound: 



'2|a + 2/i n ||J/3*| 



1-41 + 



2/Jn 



3aA nV /|^l| 



(40) 
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Thanks to Assumption (Al), f3^ J < L\ log (p) |^4| and ||J/?*||oo < L 2 log(p). 
Moreover the tuning parameters A n and /i n are chosen in the form A n = Ki<r^/log (p)/n 
and ji n = K3<r/n. Then we conclude from (1371) and (1401 ) 



jSL _ a*\\ < 1 ( 3 j 1 I 4Lik 3 I 2Lik 3 I / 2Lik 3 , 8Li L 2 k,% x 

> A> ||oo ^ j K30- ^ 4 -r a-1 -r -r -3^3- 1- y 3q ( q _i) k ^ i" g^Zi)^ A n 



This ends the proof. □ 



Proof of Theorem The proof of this theorem is essentially an adaptation of the 
one concerning the Lasso in [37] • We do not give the whole proof but only mention 
the important steps and let the reader refer to |37j for more details. The main points 
in the proof are Stein's lemma and these few facts: 

• For every couple (A,/i), the S-Lasso estimator is a continuous function of Y . 

• For every couple (A, fx) = £, the active set and the sign vector of (3^ L which 
we denote by Sgn^ are piecewise constant with respect to Y, out of a set with 
Lebesgue measure equal to 0. 

The detailed proof uses these points and the explicit form of the estimator [3 SL given 
by (l26l) . This proof is the same as the one in [37] so that we omit it here. □ 
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